Anthony Wilson

5/23/2020

Milestone 5

In [1]:
import pandas as pd
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from geonamescache import GeonamesCache 
from geonamescache.mappers import country
import plotly.express as px

Merging the data sets

In [2]:
# creating connection to database
con = sqlite3.connect('dsc_540_project.sqlite')

I will be joining my data together by using sqlite. I will write common table expressions to summarize my data for each table and then join them all together. I will first write the code for each table to make sure that is doing what I expect it to be doing. Then I will use the CTE's to combine the data.

I will first need to do some work to get the data to integrate without duplicating on the joins. The covid API data, has some countries that have more detailed information based on province or city. I will aggregate the cases by country and get rid of geo coordinates, providence and city data.

After looking at the data it looks like there are duplicates in the US and Canada. See query below. We can either look at the whole country or a summarized view of these countries. I validated this by looking up the numbers on the API site.

In [3]:
cursor = con.execute("""   SELECT COUNTRY ,
                            COUNTRYCODE,
                            DATE ,
                            case when province is null and city is null then 1 else 0 end as ck,
                            SUM(CONFIRMED) AS CONFIRMED,
                            SUM(DEATHS) AS DEATHS,
                            SUM(RECOVERED) AS RECOVERED,
                            SUM(ACTIVE) AS ACTIVE
                            FROM COVID_API
                            WHERE DATE = '2020-05-17'
                            and country in ('Canada', 'Denmark', 'France', 'Netherlands', 'United Kingdom', 'United States of America')
                            GROUP BY 1,2,3,4
                            
                            ;""")
rows = cursor.fetchall()
rows #validating the data
Out[3]:
[('Canada', 'CA', '2020-05-17', 0, 78331, 5902, 0, 72429),
 ('Canada', 'CA', '2020-05-17', 1, 78331, 5902, 38563, 33866),
 ('Denmark', 'DK', '2020-05-17', 0, 198, 0, 198, 0),
 ('Denmark', 'DK', '2020-05-17', 1, 10927, 547, 9227, 1153),
 ('France', 'FR', '2020-05-17', 0, 2453, 49, 1420, 984),
 ('France', 'FR', '2020-05-17', 1, 177240, 28062, 59907, 89271),
 ('Netherlands', 'NL', '2020-05-17', 0, 200, 19, 167, 14),
 ('Netherlands', 'NL', '2020-05-17', 1, 43995, 5680, 0, 38315),
 ('United Kingdom', 'GB', '2020-05-17', 0, 1300, 80, 1058, 162),
 ('United Kingdom', 'GB', '2020-05-17', 1, 243695, 34636, 0, 209059),
 ('United States of America',
  'US',
  '2020-05-17',
  0,
  1474638,
  88488,
  0,
  1386150),
 ('United States of America',
  'US',
  '2020-05-17',
  1,
  1486757,
  89562,
  272265,
  1124930)]

The US and Canada have a summary row and one that has a break down by city and province. I will create one query that pulls data for the US and Canada where province and city is null, then union that to everything that is not, the US or Canada.

In [4]:
cursor = con.execute("""   SELECT COUNTRY ,
                            COUNTRYCODE,
                            DATE ,
                            SUM(CONFIRMED) AS CONFIRMED,
                            SUM(DEATHS) AS DEATHS,
                            SUM(RECOVERED) AS RECOVERED,
                            SUM(ACTIVE) AS ACTIVE
                            FROM COVID_API
                            WHERE DATE = '2020-05-17'
                            AND COUNTRY IN ('Canada', 'United States of America')
                            AND PROVINCE IS NULL AND CITY IS NULL
                            GROUP BY 1,2,3
                            
                            UNION
                            
                            SELECT COUNTRY ,
                            COUNTRYCODE,
                            DATE ,
                            SUM(CONFIRMED) AS CONFIRMED,
                            SUM(DEATHS) AS DEATHS,
                            SUM(RECOVERED) AS RECOVERED,
                            SUM(ACTIVE) AS ACTIVE
                            FROM COVID_API
                            WHERE DATE = '2020-05-17'
                            AND COUNTRY NOT IN ('Canada', 'United States of America')
                            GROUP BY 1,2,3
                            
                            ;""")
rows = cursor.fetchall()
rows #validating the data
Out[4]:
[('Afghanistan', 'AF', '2020-05-17', 6664, 169, 778, 5717),
 ('Albania', 'AL', '2020-05-17', 946, 31, 715, 200),
 ('Algeria', 'DZ', '2020-05-17', 7019, 548, 3507, 2964),
 ('Andorra', 'AD', '2020-05-17', 761, 51, 617, 93),
 ('Angola', 'AO', '2020-05-17', 48, 2, 17, 29),
 ('Antigua and Barbuda', 'AG', '2020-05-17', 25, 3, 19, 3),
 ('Argentina', 'AR', '2020-05-17', 8068, 373, 2569, 5126),
 ('Armenia', 'AM', '2020-05-17', 4472, 60, 1925, 2487),
 ('Australia', 'AU', '2020-05-17', 7054, 99, 6392, 563),
 ('Austria', 'AT', '2020-05-17', 16242, 629, 14563, 1050),
 ('Azerbaijan', 'AZ', '2020-05-17', 3274, 39, 2015, 1220),
 ('Bahamas', 'BS', '2020-05-17', 96, 11, 42, 43),
 ('Bahrain', 'BH', '2020-05-17', 6956, 12, 2910, 4034),
 ('Bangladesh', 'BD', '2020-05-17', 22268, 328, 4373, 17567),
 ('Barbados', 'BB', '2020-05-17', 88, 7, 68, 13),
 ('Belarus', 'BY', '2020-05-17', 29650, 165, 9932, 19553),
 ('Belgium', 'BE', '2020-05-17', 55280, 9052, 14630, 31598),
 ('Belize', 'BZ', '2020-05-17', 18, 2, 16, 0),
 ('Benin', 'BJ', '2020-05-17', 339, 2, 83, 254),
 ('Bhutan', 'BT', '2020-05-17', 21, 0, 5, 16),
 ('Bolivia', 'BO', '2020-05-17', 4088, 169, 493, 3426),
 ('Bosnia and Herzegovina', 'BA', '2020-05-17', 2290, 133, 1436, 721),
 ('Botswana', 'BW', '2020-05-17', 25, 1, 17, 7),
 ('Brazil', 'BR', '2020-05-17', 241080, 16118, 94122, 130840),
 ('Brunei Darussalam', 'BN', '2020-05-17', 141, 1, 136, 4),
 ('Bulgaria', 'BG', '2020-05-17', 2211, 108, 598, 1505),
 ('Burkina Faso', 'BF', '2020-05-17', 796, 51, 644, 101),
 ('Burundi', 'BI', '2020-05-17', 23, 1, 15, 7),
 ('Cambodia', 'KH', '2020-05-17', 122, 0, 122, 0),
 ('Cameroon', 'CM', '2020-05-17', 3105, 140, 1567, 1398),
 ('Canada', 'CA', '2020-05-17', 78331, 5902, 38563, 33866),
 ('Cape Verde', 'CV', '2020-05-17', 328, 3, 84, 241),
 ('Central African Republic', 'CF', '2020-05-17', 327, 0, 13, 314),
 ('Chad', 'TD', '2020-05-17', 503, 53, 117, 333),
 ('Chile', 'CL', '2020-05-17', 43781, 450, 19213, 24118),
 ('China', 'CN', '2020-05-17', 84054, 4638, 79306, 110),
 ('Colombia', 'CO', '2020-05-17', 15574, 574, 3751, 11249),
 ('Comoros', 'KM', '2020-05-17', 11, 1, 3, 7),
 ('Congo (Brazzaville)', 'CG', '2020-05-17', 391, 15, 87, 289),
 ('Congo (Kinshasa)', 'CD', '2020-05-17', 1455, 61, 270, 1124),
 ('Costa Rica', 'CR', '2020-05-17', 863, 10, 565, 288),
 ('Croatia', 'HR', '2020-05-17', 2226, 95, 1936, 195),
 ('Cuba', 'CU', '2020-05-17', 1872, 79, 1495, 298),
 ('Cyprus', 'CY', '2020-05-17', 916, 17, 515, 384),
 ('Czech Republic', 'CZ', '2020-05-17', 8475, 298, 5462, 2715),
 ("Côte d'Ivoire", 'CI', '2020-05-17', 2109, 27, 1004, 1078),
 ('Denmark', 'DK', '2020-05-17', 11125, 547, 9425, 1153),
 ('Djibouti', 'DJ', '2020-05-17', 1401, 4, 972, 425),
 ('Dominica', 'DM', '2020-05-17', 16, 0, 16, 0),
 ('Dominican Republic', 'DO', '2020-05-17', 12314, 428, 5847, 6039),
 ('Ecuador', 'EC', '2020-05-17', 33182, 2736, 3433, 27013),
 ('Egypt', 'EG', '2020-05-17', 12229, 630, 3172, 8427),
 ('El Salvador', 'SV', '2020-05-17', 1338, 30, 464, 844),
 ('Equatorial Guinea', 'GQ', '2020-05-17', 594, 7, 22, 565),
 ('Eritrea', 'ER', '2020-05-17', 39, 0, 39, 0),
 ('Estonia', 'EE', '2020-05-17', 1774, 63, 938, 773),
 ('Ethiopia', 'ET', '2020-05-17', 317, 5, 113, 199),
 ('Fiji', 'FJ', '2020-05-17', 18, 0, 15, 3),
 ('Finland', 'FI', '2020-05-17', 6347, 298, 5000, 1049),
 ('France', 'FR', '2020-05-17', 179693, 28111, 61327, 90255),
 ('Gabon', 'GA', '2020-05-17', 1320, 11, 244, 1065),
 ('Gambia', 'GM', '2020-05-17', 23, 1, 12, 10),
 ('Georgia', 'GE', '2020-05-17', 695, 12, 425, 258),
 ('Germany', 'DE', '2020-05-17', 176369, 7962, 154011, 14396),
 ('Ghana', 'GH', '2020-05-17', 5735, 29, 1754, 3952),
 ('Greece', 'GR', '2020-05-17', 2834, 163, 1374, 1297),
 ('Grenada', 'GD', '2020-05-17', 22, 0, 14, 8),
 ('Guatemala', 'GT', '2020-05-17', 1763, 33, 138, 1592),
 ('Guinea', 'GN', '2020-05-17', 2658, 16, 1133, 1509),
 ('Guinea-Bissau', 'GW', '2020-05-17', 990, 4, 26, 960),
 ('Guyana', 'GY', '2020-05-17', 117, 10, 43, 64),
 ('Haiti', 'HT', '2020-05-17', 456, 20, 21, 415),
 ('Holy See (Vatican City State)', 'VA', '2020-05-17', 12, 0, 2, 10),
 ('Honduras', 'HN', '2020-05-17', 2646, 142, 319, 2185),
 ('Hungary', 'HU', '2020-05-17', 3509, 451, 1396, 1662),
 ('Iceland', 'IS', '2020-05-17', 1802, 10, 1786, 6),
 ('India', 'IN', '2020-05-17', 95698, 3025, 36795, 55878),
 ('Indonesia', 'ID', '2020-05-17', 17514, 1148, 4129, 12237),
 ('Iran, Islamic Republic of', 'IR', '2020-05-17', 120198, 6988, 94464, 18746),
 ('Iraq', 'IQ', '2020-05-17', 3404, 123, 2218, 1063),
 ('Ireland', 'IE', '2020-05-17', 24112, 1543, 19470, 3099),
 ('Israel', 'IL', '2020-05-17', 16617, 272, 12942, 3403),
 ('Italy', 'IT', '2020-05-17', 225435, 31908, 125176, 68351),
 ('Jamaica', 'JM', '2020-05-17', 520, 9, 127, 384),
 ('Japan', 'JP', '2020-05-17', 16285, 744, 11153, 4388),
 ('Jordan', 'JO', '2020-05-17', 613, 9, 408, 196),
 ('Kazakhstan', 'KZ', '2020-05-17', 6157, 34, 3256, 2867),
 ('Kenya', 'KE', '2020-05-17', 887, 50, 313, 524),
 ('Korea (South)', 'KR', '2020-05-17', 11065, 263, 9904, 898),
 ('Kuwait', 'KW', '2020-05-17', 14850, 112, 4093, 10645),
 ('Kyrgyzstan', 'KG', '2020-05-17', 1138, 14, 804, 320),
 ('Lao PDR', 'LA', '2020-05-17', 19, 0, 14, 5),
 ('Latvia', 'LV', '2020-05-17', 1008, 19, 662, 327),
 ('Lebanon', 'LB', '2020-05-17', 911, 26, 247, 638),
 ('Lesotho', 'LS', '2020-05-17', 1, 0, 0, 1),
 ('Liberia', 'LR', '2020-05-17', 226, 21, 120, 85),
 ('Libya', 'LY', '2020-05-17', 65, 3, 35, 27),
 ('Liechtenstein', 'LI', '2020-05-17', 82, 1, 55, 26),
 ('Lithuania', 'LT', '2020-05-17', 1541, 56, 997, 488),
 ('Luxembourg', 'LU', '2020-05-17', 3945, 107, 3702, 136),
 ('Macedonia, Republic of', 'MK', '2020-05-17', 1792, 101, 1293, 398),
 ('Madagascar', 'MG', '2020-05-17', 304, 1, 114, 189),
 ('Malawi', 'MW', '2020-05-17', 70, 3, 27, 40),
 ('Malaysia', 'MY', '2020-05-17', 6894, 113, 5571, 1210),
 ('Maldives', 'MV', '2020-05-17', 1094, 4, 58, 1032),
 ('Mali', 'ML', '2020-05-17', 860, 52, 494, 314),
 ('Malta', 'MT', '2020-05-17', 553, 6, 454, 93),
 ('Mauritania', 'MR', '2020-05-17', 62, 4, 7, 51),
 ('Mauritius', 'MU', '2020-05-17', 332, 10, 322, 0),
 ('Mexico', 'MX', '2020-05-17', 49219, 5177, 33329, 10713),
 ('Moldova', 'MD', '2020-05-17', 6060, 211, 2344, 3505),
 ('Monaco', 'MC', '2020-05-17', 96, 4, 87, 5),
 ('Mongolia', 'MN', '2020-05-17', 136, 0, 21, 115),
 ('Montenegro', 'ME', '2020-05-17', 324, 9, 311, 4),
 ('Morocco', 'MA', '2020-05-17', 6870, 192, 3660, 3018),
 ('Mozambique', 'MZ', '2020-05-17', 137, 0, 44, 93),
 ('Myanmar', 'MM', '2020-05-17', 184, 6, 97, 81),
 ('Namibia', 'NA', '2020-05-17', 16, 0, 13, 3),
 ('Nepal', 'NP', '2020-05-17', 295, 2, 36, 257),
 ('Netherlands', 'NL', '2020-05-17', 44195, 5699, 167, 38329),
 ('New Zealand', 'NZ', '2020-05-17', 1499, 21, 1433, 45),
 ('Nicaragua', 'NI', '2020-05-17', 25, 8, 7, 10),
 ('Niger', 'NE', '2020-05-17', 904, 54, 698, 152),
 ('Nigeria', 'NG', '2020-05-17', 5959, 182, 1594, 4183),
 ('Norway', 'NO', '2020-05-17', 8249, 232, 32, 7985),
 ('Oman', 'OM', '2020-05-17', 5186, 22, 1465, 3699),
 ('Pakistan', 'PK', '2020-05-17', 40151, 873, 11341, 27937),
 ('Palestinian Territory', 'PS', '2020-05-17', 381, 2, 335, 44),
 ('Panama', 'PA', '2020-05-17', 9606, 275, 6081, 3250),
 ('Papua New Guinea', 'PG', '2020-05-17', 8, 0, 8, 0),
 ('Paraguay', 'PY', '2020-05-17', 786, 11, 202, 573),
 ('Peru', 'PE', '2020-05-17', 92273, 2648, 28621, 61004),
 ('Philippines', 'PH', '2020-05-17', 12513, 824, 2635, 9054),
 ('Poland', 'PL', '2020-05-17', 18529, 925, 7451, 10153),
 ('Portugal', 'PT', '2020-05-17', 29036, 1218, 4636, 23182),
 ('Qatar', 'QA', '2020-05-17', 32604, 15, 4370, 28219),
 ('Republic of Kosovo', 'XK', '2020-05-17', 955, 29, 691, 235),
 ('Romania', 'RO', '2020-05-17', 16871, 1107, 9890, 5874),
 ('Russian Federation', 'RU', '2020-05-17', 281752, 2631, 67373, 211748),
 ('Rwanda', 'RW', '2020-05-17', 292, 0, 197, 95),
 ('Saint Kitts and Nevis', 'KN', '2020-05-17', 15, 0, 14, 1),
 ('Saint Lucia', 'LC', '2020-05-17', 18, 0, 18, 0),
 ('Saint Vincent and Grenadines', 'VC', '2020-05-17', 17, 0, 14, 3),
 ('San Marino', 'SM', '2020-05-17', 654, 41, 201, 412),
 ('Sao Tome and Principe', 'ST', '2020-05-17', 235, 7, 4, 224),
 ('Saudi Arabia', 'SA', '2020-05-17', 54752, 312, 25722, 28718),
 ('Senegal', 'SN', '2020-05-17', 2480, 25, 973, 1482),
 ('Serbia', 'RS', '2020-05-17', 10610, 230, 4713, 5667),
 ('Seychelles', 'SC', '2020-05-17', 11, 0, 10, 1),
 ('Sierra Leone', 'SL', '2020-05-17', 505, 32, 141, 332),
 ('Singapore', 'SG', '2020-05-17', 28038, 22, 9340, 18676),
 ('Slovakia', 'SK', '2020-05-17', 1494, 28, 1163, 303),
 ('Slovenia', 'SI', '2020-05-17', 1466, 104, 273, 1089),
 ('Somalia', 'SO', '2020-05-17', 1421, 56, 152, 1213),
 ('South Africa', 'ZA', '2020-05-17', 15515, 264, 7006, 8245),
 ('South Sudan', 'SS', '2020-05-17', 290, 4, 4, 282),
 ('Spain', 'ES', '2020-05-17', 230698, 27563, 146446, 56689),
 ('Sri Lanka', 'LK', '2020-05-17', 981, 9, 538, 434),
 ('Sudan', 'SD', '2020-05-17', 2289, 97, 222, 1970),
 ('Suriname', 'SR', '2020-05-17', 10, 1, 9, 0),
 ('Swaziland', 'SZ', '2020-05-17', 203, 2, 73, 128),
 ('Sweden', 'SE', '2020-05-17', 30143, 3679, 4971, 21493),
 ('Switzerland', 'CH', '2020-05-17', 30587, 1881, 27500, 1206),
 ('Syrian Arab Republic (Syria)', 'SY', '2020-05-17', 58, 3, 36, 19),
 ('Taiwan, Republic of China', 'TW', '2020-05-17', 440, 7, 395, 38),
 ('Tajikistan', 'TJ', '2020-05-17', 1524, 39, 0, 1485),
 ('Tanzania, United Republic of', 'TZ', '2020-05-17', 509, 21, 183, 305),
 ('Thailand', 'TH', '2020-05-17', 3028, 56, 2856, 116),
 ('Timor-Leste', 'TL', '2020-05-17', 24, 0, 24, 0),
 ('Togo', 'TG', '2020-05-17', 301, 11, 104, 186),
 ('Trinidad and Tobago', 'TT', '2020-05-17', 116, 8, 107, 1),
 ('Tunisia', 'TN', '2020-05-17', 1037, 45, 816, 176),
 ('Turkey', 'TR', '2020-05-17', 149435, 4140, 109962, 35333),
 ('Uganda', 'UG', '2020-05-17', 227, 0, 63, 164),
 ('Ukraine', 'UA', '2020-05-17', 18291, 514, 5116, 12661),
 ('United Arab Emirates', 'AE', '2020-05-17', 23358, 220, 8512, 14626),
 ('United Kingdom', 'GB', '2020-05-17', 244995, 34716, 1058, 209221),
 ('United States of America',
  'US',
  '2020-05-17',
  1486757,
  89562,
  272265,
  1124930),
 ('Uruguay', 'UY', '2020-05-17', 734, 20, 564, 150),
 ('Uzbekistan', 'UZ', '2020-05-17', 2753, 12, 2247, 494),
 ('Venezuela (Bolivarian Republic)', 'VE', '2020-05-17', 541, 10, 241, 290),
 ('Viet Nam', 'VN', '2020-05-17', 320, 0, 260, 60),
 ('Western Sahara', 'EH', '2020-05-17', 6, 0, 6, 0),
 ('Yemen', 'YE', '2020-05-17', 128, 20, 1, 107),
 ('Zambia', 'ZM', '2020-05-17', 753, 7, 188, 558),
 ('Zimbabwe', 'ZW', '2020-05-17', 44, 4, 17, 23)]

The wash data in milestone two, I put together two data sets, one wide and one tall. In the tall data set, I have country. I am using the tall dataset to create a distinct list of country codes and country for a look up table to add the full name of the country.

In [5]:
cursor = con.execute("""SELECT
                            B.CNTRY,
                            A.*
                            FROM WASH_DATA_WIDE A
                                INNER JOIN (
                                                SELECT DISTINCT CNTRY_CD,CNTRY --distinct list of country and country code
                                                FROM WASH_DATA
                                            ) AS B 
                                        ON A.CNTRY_CD = B.CNTRY_CD;""")
rows = cursor.fetchall()
rows #validating the data
Out[5]:
[('Armenia',
  'ARM',
  2016,
  2924816.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  69.0,
  31.0,
  0.0,
  0.0),
 ('Azerbaijan',
  'AZE',
  2016,
  9725376.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0),
 ('Bhutan',
  'BTN',
  2016,
  797765.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  57.14286,
  42.85714,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0),
 ('China',
  'CHN',
  2016,
  1403500416.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  35.7,
  0.0,
  63.91,
  0.39,
  35.7,
  0.0,
  63.91,
  0.39,
  0.0,
  100.0,
  0.0,
  0.0,
  35.7,
  0.0,
  63.91,
  0.39),
 ('Czechia',
  'CZE',
  2016,
  10610947.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0),
 ('Egypt',
  'EGY',
  2016,
  95688680.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  99.82700000000001,
  0.0,
  0.17300000000000001,
  0.0,
  99.8211,
  0.0,
  0.1789,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  99.84825,
  0.0,
  0.15175),
 ('Estonia',
  'EST',
  2016,
  1312442.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0),
 ('India',
  'IND',
  2016,
  1324171392.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  76.45623,
  0.0,
  0.0,
  23.543770000000002,
  0.0,
  49.79684,
  0.0,
  50.20316,
  0.0,
  60.9048,
  0.0,
  39.0952,
  0.0,
  37.6771,
  0.0,
  62.3229,
  0.0,
  58.266709999999996,
  0.0,
  41.733290000000004),
 ('Kyrgyzstan',
  'KGZ',
  2016,
  5955734.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  61.5385,
  38.4615,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0),
 ('Kuwait',
  'KWT',
  2016,
  4052584.0,
  100.0,
  0.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0),
 ('Liberia',
  'LBR',
  2016,
  4613823.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  36.0,
  64.0,
  0.0,
  0.0),
 ('Lithuania',
  'LTU',
  2016,
  2908249.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  99.0,
  1.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0),
 ('Maldives',
  'MDV',
  2016,
  427756.0,
  75.0,
  0.0,
  25.0,
  0.0,
  79.8913,
  0.0,
  20.1087,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  79.78723000000001,
  0.0,
  20.212770000000003,
  0.0),
 ('Montenegro',
  'MNE',
  2016,
  628615.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0),
 ('Nigeria',
  'NGA',
  2016,
  185989647.99999997,
  0.0,
  98.3988,
  0.0,
  1.6012,
  0.0,
  99.7561,
  0.0,
  0.2439,
  0.0,
  99.41175,
  0.0,
  0.58825,
  0.0,
  99.0324,
  0.0,
  0.9676,
  0.0,
  99.17275,
  0.0,
  0.82725,
  0.0,
  98.8889,
  0.0,
  1.1111,
  43.0,
  0.0,
  43.7893,
  13.2107),
 ('San Marino',
  'SMR',
  2016,
  33203.0,
  100.0,
  0.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0),
 ('Serbia',
  'SRB',
  2016,
  8820082.999999998,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  0.0,
  100.0,
  0.0,
  0.0,
  99.8,
  0.2,
  0.0,
  0.0),
 ('United Republic of Tanzania',
  'TZA',
  2016,
  55572200.0,
  60.8696,
  39.1304,
  0.0,
  0.0,
  20.5742,
  79.4258,
  0.0,
  0.0,
  58.3333,
  41.6667,
  0.0,
  0.0,
  32.8402,
  67.1598,
  0.0,
  0.0,
  24.2915,
  75.7085,
  0.0,
  0.0,
  56.5574,
  43.4426,
  0.0,
  0.0,
  34.8765,
  65.1235,
  0.0,
  0.0),
 ('Zimbabwe',
  'ZWE',
  2016,
  16150362.0,
  70.4,
  0.0,
  24.9,
  4.7,
  56.6,
  0.0,
  32.9,
  10.5,
  55.8,
  0.0,
  34.8,
  9.4,
  58.6,
  0.0,
  31.5,
  9.9,
  53.3,
  0.0,
  35.1,
  11.6,
  60.6,
  0.0,
  30.5,
  8.9,
  58.2,
  0.0,
  32.0,
  9.8)]

Milestone three is based on doing a google search on covid, with the country's full name. We will need to update some column names in order to not have overlap between the API and the google web scrape. The counts from the google web scrape, I will rename to 'gs_' + 'column name'. The data will be summed up by country, since we pulled the top news articles for that day and counted the number of times keywords were used.

In [6]:
cursor = con.execute("""SELECT COUNTRY, 
                              SUM(VIRUS) as GS_VIRUS, 
                              SUM(CURE) AS GS_CURE,
                              SUM(DEATH) AS GS_DEATH, 
                              SUM(CASES) AS GS_CASES, 
                              SUM(MONEY) AS GS_MONEY, 
                              SUM(DISTANCE) AS GS_DIST, 
                              SUM(DEMOGRAPHIC) AS GS_DEMO,
                              SUM(SUPPLIES) AS GS_SUPPLIES
                            FROM webScrape_Counts_2
                            GROUP BY 1;""")
rows = cursor.fetchall()
rows #validating the data
Out[6]:
[('Afghanistan', 58.0, 1.0, 12.0, 16.0, 19.0, 5.0, 37.0, 5.0),
 ('Albania', 75.0, 13.0, 4.0, 20.0, 11.0, 9.0, 36.0, 18.0),
 ('Algeria', 140.0, 15.0, 10.0, 34.0, 11.0, 12.0, 97.0, 47.0),
 ('Andorra', 49.0, 13.0, 9.0, 22.0, 9.0, 4.0, 26.0, 11.0),
 ('Angola', 50.0, 1.0, 8.0, 21.0, 8.0, 10.0, 57.0, 18.0),
 ('Antigua and Barbuda', 32.0, 2.0, 8.0, 17.0, 2.0, 8.0, 35.0, 23.0),
 ('Argentina', 65.0, 3.0, 7.0, 20.0, 55.0, 12.0, 59.0, 14.0),
 ('Armenia', 73.0, 1.0, 4.0, 32.0, 21.0, 0.0, 25.0, 4.0),
 ('Australia', 132.0, 4.0, 18.0, 33.0, 27.0, 28.0, 100.0, 51.0),
 ('Austria', 98.0, 15.0, 19.0, 58.0, 12.0, 33.0, 113.0, 32.0),
 ('Azerbaijan', 14.0, 2.0, 3.0, 10.0, 1.0, 9.0, 25.0, 6.0),
 ('Bahamas', 43.0, 3.0, 17.0, 30.0, 17.0, 13.0, 63.0, 25.0),
 ('Bahrain', 40.0, 6.0, 0.0, 43.0, 3.0, 11.0, 41.0, 7.0),
 ('Bangladesh', 48.0, 2.0, 14.0, 36.0, 22.0, 9.0, 55.0, 17.0),
 ('Barbados', 44.0, 0.0, 6.0, 27.0, 0.0, 7.0, 38.0, 13.0),
 ('Belarus', 63.0, 1.0, 16.0, 32.0, 6.0, 14.0, 69.0, 8.0),
 ('Belgium', 123.0, 4.0, 31.0, 34.0, 22.0, 16.0, 86.0, 20.0),
 ('Belize', 86.0, 5.0, 2.0, 133.0, 8.0, 6.0, 84.0, 28.0),
 ('Benin', 33.0, 1.0, 2.0, 15.0, 7.0, 2.0, 49.0, 4.0),
 ('Bhutan', 70.0, 3.0, 2.0, 27.0, 7.0, 7.0, 57.0, 11.0),
 ('Bolivia', 48.0, 4.0, 4.0, 15.0, 11.0, 5.0, 41.0, 17.0),
 ('Bosnia and Herzegovina', 51.0, 1.0, 2.0, 10.0, 8.0, 8.0, 36.0, 2.0),
 ('Botswana', 83.0, 8.0, 17.0, 54.0, 8.0, 12.0, 69.0, 18.0),
 ('Brazil', 71.0, 20.0, 3.0, 16.0, 5.0, 12.0, 63.0, 18.0),
 ('Brunei Darussalam', 181.0, 13.0, 2.0, 58.0, 22.0, 9.0, 45.0, 26.0),
 ('Bulgaria', 34.0, 0.0, 2.0, 8.0, 15.0, 5.0, 23.0, 21.0),
 ('Burkina Faso', 63.0, 0.0, 2.0, 29.0, 4.0, 5.0, 40.0, 10.0),
 ('Burundi', 53.0, 11.0, 3.0, 27.0, 5.0, 6.0, 66.0, 18.0),
 ('Cambodia', 89.0, 2.0, 10.0, 26.0, 1.0, 9.0, 61.0, 45.0),
 ('Cameroon', 118.0, 8.0, 15.0, 50.0, 11.0, 5.0, 58.0, 11.0),
 ('Canada', 109.0, 6.0, 10.0, 14.0, 17.0, 7.0, 76.0, 30.0),
 ('Cape Verde', 52.0, 3.0, 2.0, 21.0, 3.0, 5.0, 25.0, 19.0),
 ('Central African Republic', 40.0, 17.0, 0.0, 12.0, 4.0, 9.0, 29.0, 13.0),
 ('Chad', 64.0, 1.0, 2.0, 14.0, 5.0, 6.0, 70.0, 23.0),
 ('Chile', 96.0, 5.0, 7.0, 29.0, 43.0, 10.0, 90.0, 16.0),
 ('China', 84.0, 13.0, 10.0, 21.0, 19.0, 10.0, 35.0, 14.0),
 ('Colombia', 45.0, 5.0, 1.0, 12.0, 7.0, 9.0, 36.0, 2.0),
 ('Comoros', 62.0, 6.0, 14.0, 26.0, 4.0, 12.0, 49.0, 12.0),
 ('Congo (Brazzaville)', 119.0, 25.0, 8.0, 90.0, 17.0, 11.0, 65.0, 23.0),
 ('Congo (Kinshasa)', 39.0, 5.0, 2.0, 10.0, 6.0, 5.0, 20.0, 16.0),
 ('Costa Rica', 80.0, 32.0, 10.0, 5.0, 5.0, 2.0, 23.0, 19.0),
 ('Croatia', 35.0, 7.0, 2.0, 15.0, 9.0, 7.0, 71.0, 19.0),
 ('Cuba', 31.0, 5.0, 2.0, 4.0, 3.0, 3.0, 21.0, 9.0),
 ('Cyprus', 47.0, 2.0, 4.0, 42.0, 8.0, 4.0, 32.0, 3.0),
 ('Czech Republic', 66.0, 3.0, 5.0, 9.0, 8.0, 3.0, 55.0, 30.0),
 ("Côte d'Ivoire", 29.0, 15.0, 3.0, 17.0, 4.0, 2.0, 62.0, 8.0),
 ('Denmark', 50.0, 29.0, 5.0, 18.0, 2.0, 7.0, 34.0, 7.0),
 ('Djibouti', 155.0, 15.0, 19.0, 76.0, 11.0, 20.0, 101.0, 39.0),
 ('Dominica', 37.0, 1.0, 4.0, 5.0, 1.0, 6.0, 18.0, 22.0),
 ('Dominican Republic', 56.0, 9.0, 13.0, 23.0, 8.0, 2.0, 37.0, 30.0),
 ('Ecuador', 97.0, 8.0, 44.0, 17.0, 14.0, 18.0, 116.0, 34.0),
 ('Egypt', 90.0, 4.0, 10.0, 15.0, 6.0, 9.0, 47.0, 11.0),
 ('El Salvador', 35.0, 3.0, 1.0, 15.0, 3.0, 5.0, 17.0, 3.0),
 ('Equatorial Guinea', 20.0, 4.0, 1.0, 2.0, 2.0, 0.0, 7.0, 7.0),
 ('Eritrea', 25.0, 14.0, 3.0, 2.0, 0.0, 0.0, 11.0, 3.0),
 ('Estonia', 89.0, 5.0, 10.0, 48.0, 5.0, 9.0, 73.0, 28.0),
 ('Ethiopia', 85.0, 3.0, 5.0, 18.0, 4.0, 8.0, 83.0, 13.0),
 ('Fiji', 63.0, 17.0, 2.0, 16.0, 9.0, 6.0, 40.0, 34.0),
 ('Finland', 27.0, 20.0, 2.0, 9.0, 3.0, 1.0, 25.0, 4.0),
 ('France', 107.0, 5.0, 26.0, 28.0, 14.0, 22.0, 63.0, 21.0),
 ('Gabon', 58.0, 11.0, 5.0, 20.0, 7.0, 3.0, 37.0, 7.0),
 ('Gambia', 24.0, 2.0, 0.0, 4.0, 13.0, 3.0, 24.0, 0.0),
 ('Georgia', 59.0, 0.0, 17.0, 41.0, 21.0, 0.0, 66.0, 26.0),
 ('Germany', 60.0, 8.0, 14.0, 88.0, 12.0, 11.0, 65.0, 37.0),
 ('Ghana', 113.0, 2.0, 9.0, 16.0, 21.0, 9.0, 68.0, 30.0),
 ('Greece', 68.0, 3.0, 14.0, 29.0, 9.0, 8.0, 52.0, 10.0),
 ('Grenada', 59.0, 0.0, 6.0, 27.0, 4.0, 9.0, 42.0, 36.0),
 ('Guatemala', 37.0, 0.0, 2.0, 14.0, 8.0, 4.0, 67.0, 10.0),
 ('Guinea', 37.0, 15.0, 7.0, 4.0, 5.0, 2.0, 32.0, 13.0),
 ('Guyana', 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0),
 ('Haiti', 43.0, 0.0, 6.0, 8.0, 8.0, 9.0, 53.0, 9.0),
 ('Holy See (Vatican City State)', 64.0, 3.0, 8.0, 24.0, 7.0, 4.0, 64.0, 14.0),
 ('Honduras', 46.0, 5.0, 3.0, 21.0, 8.0, 6.0, 40.0, 8.0),
 ('Hungary', 93.0, 2.0, 10.0, 21.0, 13.0, 12.0, 62.0, 10.0),
 ('Iceland', 84.0, 11.0, 25.0, 29.0, 6.0, 6.0, 55.0, 36.0),
 ('India', 64.0, 10.0, 8.0, 21.0, 16.0, 17.0, 78.0, 30.0),
 ('Indonesia', 77.0, 14.0, 10.0, 42.0, 5.0, 5.0, 42.0, 16.0),
 ('Iran, Islamic Republic of', 53.0, 8.0, 11.0, 20.0, 12.0, 3.0, 37.0, 26.0),
 ('Iraq', 83.0, 7.0, 10.0, 13.0, 20.0, 8.0, 50.0, 90.0),
 ('Ireland', 57.0, 6.0, 6.0, 5.0, 8.0, 9.0, 74.0, 17.0),
 ('Israel', 62.0, 5.0, 4.0, 3.0, 5.0, 5.0, 29.0, 13.0),
 ('Italy', 47.0, 0.0, 22.0, 24.0, 4.0, 6.0, 53.0, 8.0),
 ('Jamaica', 68.0, 12.0, 6.0, 11.0, 22.0, 7.0, 26.0, 11.0),
 ('Japan', 59.0, 9.0, 7.0, 23.0, 5.0, 10.0, 34.0, 4.0),
 ('Jordan', 42.0, 0.0, 4.0, 11.0, 4.0, 11.0, 16.0, 5.0),
 ('Kazakhstan', 74.0, 0.0, 7.0, 23.0, 24.0, 6.0, 21.0, 12.0),
 ('Kenya', 68.0, 3.0, 7.0, 22.0, 4.0, 2.0, 46.0, 21.0),
 ('Korea (South)', 74.0, 1.0, 8.0, 29.0, 6.0, 4.0, 62.0, 25.0),
 ('Kuwait', 52.0, 8.0, 17.0, 61.0, 0.0, 1.0, 20.0, 5.0),
 ('Kyrgyzstan', 55.0, 3.0, 4.0, 12.0, 7.0, 12.0, 39.0, 23.0),
 ('Lao PDR', 102.0, 1.0, 2.0, 16.0, 13.0, 0.0, 80.0, 33.0),
 ('Latvia', 53.0, 0.0, 9.0, 3.0, 2.0, 6.0, 13.0, 6.0),
 ('Lebanon', 43.0, 2.0, 5.0, 18.0, 3.0, 3.0, 34.0, 16.0),
 ('Liberia', 82.0, 10.0, 2.0, 6.0, 17.0, 7.0, 67.0, 18.0),
 ('Libya', 42.0, 2.0, 2.0, 3.0, 2.0, 2.0, 38.0, 3.0),
 ('Liechtenstein', 31.0, 0.0, 7.0, 11.0, 19.0, 6.0, 23.0, 11.0),
 ('Lithuania', 9.0, 2.0, 3.0, 6.0, 3.0, 5.0, 22.0, 1.0),
 ('Luxembourg', 68.0, 6.0, 19.0, 17.0, 18.0, 2.0, 37.0, 10.0),
 ('Macedonia, Republic of', 62.0, 0.0, 1.0, 24.0, 10.0, 11.0, 49.0, 24.0),
 ('Madagascar', 15.0, 22.0, 5.0, 1.0, 9.0, 0.0, 15.0, 5.0),
 ('Malawi', 43.0, 3.0, 2.0, 2.0, 10.0, 3.0, 29.0, 2.0),
 ('Malaysia', 39.0, 1.0, 10.0, 22.0, 3.0, 2.0, 28.0, 7.0),
 ('Maldives', 112.0, 8.0, 9.0, 46.0, 5.0, 15.0, 67.0, 21.0),
 ('Mali', 46.0, 3.0, 5.0, 8.0, 4.0, 0.0, 48.0, 6.0),
 ('Malta', 41.0, 3.0, 4.0, 52.0, 0.0, 11.0, 73.0, 14.0),
 ('Mauritania', 35.0, 1.0, 0.0, 12.0, 6.0, 10.0, 30.0, 6.0),
 ('Mauritius', 25.0, 3.0, 3.0, 11.0, 9.0, 5.0, 33.0, 2.0),
 ('Mexico', 40.0, 1.0, 6.0, 11.0, 4.0, 4.0, 13.0, 16.0),
 ('Moldova', 9.0, 0.0, 2.0, 3.0, 3.0, 2.0, 10.0, 3.0),
 ('Monaco', 60.0, 3.0, 3.0, 8.0, 21.0, 10.0, 54.0, 23.0),
 ('Mongolia', 35.0, 0.0, 3.0, 4.0, 9.0, 1.0, 17.0, 13.0),
 ('Montenegro', 14.0, 0.0, 1.0, 1.0, 1.0, 4.0, 11.0, 0.0),
 ('Morocco', 45.0, 10.0, 3.0, 8.0, 2.0, 6.0, 24.0, 11.0),
 ('Mozambique', 15.0, 0.0, 1.0, 10.0, 4.0, 2.0, 18.0, 3.0),
 ('Myanmar', 83.0, 5.0, 6.0, 27.0, 1.0, 13.0, 52.0, 21.0),
 ('Namibia', 50.0, 3.0, 0.0, 13.0, 4.0, 4.0, 25.0, 11.0),
 ('Nepal', 74.0, 21.0, 6.0, 15.0, 9.0, 10.0, 69.0, 25.0),
 ('Netherlands', 39.0, 0.0, 1.0, 21.0, 2.0, 2.0, 15.0, 4.0),
 ('New Zealand', 49.0, 1.0, 6.0, 29.0, 1.0, 1.0, 82.0, 9.0),
 ('Nicaragua', 63.0, 5.0, 4.0, 19.0, 21.0, 8.0, 67.0, 20.0),
 ('Niger', 9.0, 1.0, 1.0, 11.0, 0.0, 1.0, 15.0, 1.0),
 ('Nigeria', 73.0, 5.0, 3.0, 17.0, 23.0, 10.0, 32.0, 27.0),
 ('Norway', 19.0, 0.0, 0.0, 2.0, 6.0, 2.0, 18.0, 2.0),
 ('Oman', 22.0, 1.0, 2.0, 8.0, 7.0, 3.0, 13.0, 5.0),
 ('Pakistan', 141.0, 14.0, 22.0, 47.0, 18.0, 13.0, 84.0, 33.0),
 ('Palestinian Territory', 44.0, 5.0, 1.0, 10.0, 5.0, 0.0, 22.0, 8.0),
 ('Panama', 44.0, 0.0, 3.0, 35.0, 3.0, 16.0, 63.0, 7.0),
 ('Papua New Guinea', 35.0, 0.0, 1.0, 6.0, 0.0, 4.0, 16.0, 10.0),
 ('Paraguay', 40.0, 0.0, 3.0, 3.0, 14.0, 2.0, 26.0, 13.0),
 ('Peru', 25.0, 3.0, 0.0, 8.0, 4.0, 2.0, 17.0, 13.0),
 ('Philippines', 47.0, 4.0, 6.0, 9.0, 15.0, 10.0, 64.0, 28.0),
 ('Poland', 8.0, 0.0, 4.0, 5.0, 1.0, 2.0, 21.0, 6.0),
 ('Portugal', 39.0, 6.0, 5.0, 9.0, 7.0, 7.0, 38.0, 4.0),
 ('Qatar', 32.0, 9.0, 0.0, 5.0, 0.0, 2.0, 16.0, 14.0),
 ('Republic of Kosovo', 33.0, 4.0, 1.0, 1.0, 2.0, 1.0, 26.0, 19.0),
 ('Romania', 55.0, 10.0, 5.0, 8.0, 5.0, 8.0, 66.0, 22.0),
 ('Russian Federation', 99.0, 9.0, 7.0, 35.0, 16.0, 4.0, 85.0, 24.0),
 ('Rwanda', 76.0, 3.0, 6.0, 13.0, 5.0, 4.0, 20.0, 6.0),
 ('Saint Kitts and Nevis', 32.0, 7.0, 3.0, 8.0, 7.0, 6.0, 30.0, 6.0),
 ('Saint Lucia', 28.0, 3.0, 1.0, 9.0, 5.0, 3.0, 30.0, 3.0),
 ('Saint Vincent and Grenadines', 32.0, 0.0, 1.0, 6.0, 0.0, 13.0, 20.0, 4.0),
 ('San Marino', 48.0, 2.0, 4.0, 24.0, 5.0, 6.0, 38.0, 11.0),
 ('Sao Tome and Principe', 49.0, 1.0, 11.0, 26.0, 4.0, 10.0, 40.0, 14.0),
 ('Saudi Arabia', 137.0, 13.0, 30.0, 47.0, 27.0, 19.0, 92.0, 18.0),
 ('Senegal', 38.0, 4.0, 0.0, 7.0, 6.0, 2.0, 32.0, 12.0),
 ('Serbia', 106.0, 4.0, 12.0, 25.0, 6.0, 6.0, 61.0, 23.0),
 ('Seychelles', 33.0, 4.0, 4.0, 11.0, 2.0, 8.0, 29.0, 7.0),
 ('Sierra Leone', 62.0, 0.0, 11.0, 13.0, 2.0, 3.0, 39.0, 7.0),
 ('Singapore', 69.0, 7.0, 17.0, 99.0, 11.0, 9.0, 71.0, 13.0),
 ('Slovakia', 123.0, 3.0, 10.0, 48.0, 13.0, 13.0, 83.0, 24.0),
 ('Slovenia', 45.0, 5.0, 5.0, 5.0, 14.0, 6.0, 42.0, 12.0),
 ('Somalia', 43.0, 0.0, 1.0, 13.0, 1.0, 6.0, 21.0, 11.0),
 ('South Africa', 41.0, 2.0, 4.0, 7.0, 11.0, 4.0, 20.0, 5.0),
 ('South Sudan', 54.0, 2.0, 10.0, 21.0, 13.0, 2.0, 37.0, 10.0),
 ('Spain', 126.0, 5.0, 29.0, 83.0, 14.0, 13.0, 74.0, 17.0),
 ('Sri Lanka', 52.0, 9.0, 4.0, 10.0, 32.0, 10.0, 42.0, 12.0),
 ('Sudan', 30.0, 3.0, 2.0, 6.0, 1.0, 1.0, 42.0, 0.0),
 ('Suriname', 22.0, 1.0, 2.0, 0.0, 4.0, 3.0, 21.0, 10.0),
 ('Swaziland', 48.0, 3.0, 4.0, 7.0, 0.0, 3.0, 38.0, 12.0),
 ('Sweden', 28.0, 3.0, 2.0, 9.0, 9.0, 7.0, 57.0, 4.0),
 ('Switzerland', 80.0, 7.0, 29.0, 75.0, 10.0, 4.0, 46.0, 9.0),
 ('Syrian Arab Republic (Syria)', 42.0, 0.0, 3.0, 5.0, 3.0, 5.0, 37.0, 14.0),
 ('Taiwan, Republic of China', 79.0, 3.0, 16.0, 24.0, 11.0, 9.0, 39.0, 16.0),
 ('Tajikistan', 40.0, 4.0, 4.0, 9.0, 2.0, 8.0, 31.0, 8.0),
 ('Tanzania, United Republic of', 60.0, 5.0, 5.0, 14.0, 6.0, 9.0, 76.0, 22.0),
 ('Thailand', 53.0, 3.0, 3.0, 15.0, 40.0, 1.0, 31.0, 5.0),
 ('Timor-Leste', 15.0, 1.0, 4.0, 8.0, 2.0, 2.0, 10.0, 2.0),
 ('Togo', 25.0, 1.0, 3.0, 3.0, 14.0, 12.0, 35.0, 24.0),
 ('Trinidad and Tobago', 42.0, 0.0, 10.0, 34.0, 3.0, 10.0, 45.0, 12.0),
 ('Tunisia', 33.0, 0.0, 1.0, 6.0, 2.0, 4.0, 30.0, 3.0),
 ('Turkey', 37.0, 8.0, 4.0, 3.0, 5.0, 6.0, 27.0, 19.0),
 ('Uganda', 39.0, 8.0, 5.0, 12.0, 9.0, 3.0, 33.0, 8.0),
 ('Ukraine', 35.0, 4.0, 1.0, 4.0, 1.0, 8.0, 22.0, 2.0),
 ('United Arab Emirates', 37.0, 6.0, 1.0, 0.0, 13.0, 0.0, 22.0, 8.0),
 ('United Kingdom', 67.0, 1.0, 7.0, 17.0, 6.0, 5.0, 36.0, 7.0),
 ('United States of America',
  101.0,
  23.0,
  21.0,
  38.0,
  21.0,
  18.0,
  114.0,
  29.0),
 ('Uruguay', 123.0, 1.0, 20.0, 63.0, 27.0, 15.0, 82.0, 40.0),
 ('Uzbekistan', 22.0, 3.0, 2.0, 2.0, 10.0, 17.0, 28.0, 7.0),
 ('Venezuela (Bolivarian Republic)',
  25.0,
  8.0,
  7.0,
  14.0,
  10.0,
  4.0,
  70.0,
  30.0),
 ('Viet Nam', 69.0, 4.0, 4.0, 24.0, 5.0, 11.0, 82.0, 22.0),
 ('Yemen', 54.0, 6.0, 14.0, 22.0, 6.0, 6.0, 47.0, 17.0),
 ('Zambia', 34.0, 0.0, 9.0, 31.0, 7.0, 5.0, 22.0, 17.0),
 ('Zimbabwe', 47.0, 2.0, 1.0, 10.0, 1.0, 2.0, 47.0, 22.0)]

As we start to join the data, we need to ensure that we are getting everything, and the joins work appropriately. Below I am doing a distinct count on the number of countries listed in each table. I will do a union of the counts from each table, so we can see how many countries are returned. This is done to ensure all the data is being pulled in.

In [7]:
cursor = con.execute("""SELECT'COVID' AS T_NAME, COUNT(DISTINCT COUNTRY) AS CNTRY, COUNT(1) AS ROW_COUNT
                        FROM (
                             SELECT COUNTRY ,
                            COUNTRYCODE,
                            DATE ,
                            SUM(CONFIRMED) AS CONFIRMED,
                            SUM(DEATHS) AS DEATHS,
                            SUM(RECOVERED) AS RECOVERED,
                            SUM(ACTIVE) AS ACTIVE
                            FROM COVID_API
                            WHERE DATE = '2020-05-17'
                            AND COUNTRY IN ('Canada', 'United States of America')
                            AND PROVINCE IS NULL AND CITY IS NULL
                            GROUP BY 1,2,3
                            
                            UNION
                            
                            SELECT COUNTRY ,
                            COUNTRYCODE,
                            DATE ,
                            SUM(CONFIRMED) AS CONFIRMED,
                            SUM(DEATHS) AS DEATHS,
                            SUM(RECOVERED) AS RECOVERED,
                            SUM(ACTIVE) AS ACTIVE
                            FROM COVID_API
                            WHERE DATE = '2020-05-17'
                            AND COUNTRY NOT IN ('Canada', 'United States of America')
                            GROUP BY 1,2,3  
                        )
                        GROUP BY 1
                        UNION
                        SELECT'WASH' AS T_NAME, COUNT(DISTINCT CNTRY) AS CNTRY, COUNT(1) AS ROW_COUNT
                        FROM (
                             SELECT
                                B.CNTRY,
                                A.*
                            FROM WASH_DATA_WIDE A
                                INNER JOIN (
                                                SELECT DISTINCT CNTRY_CD,CNTRY --distinct list of country and country code
                                                FROM WASH_DATA
                                            ) AS B 
                                        ON A.CNTRY_CD = B.CNTRY_CD
                        ) AS X
                        GROUP BY 1
                        UNION
                        SELECT'GS' AS T_NAME, COUNT(DISTINCT COUNTRY) AS CNTRY, COUNT(1) AS ROW_COUNT
                        FROM (
                            SELECT COUNTRY, 
                                  SUM(VIRUS) as GS_VIRUS, 
                                  SUM(CURE) AS GS_CURE,
                                  SUM(DEATH) AS GS_DEATH, 
                                  SUM(CASES) AS GS_CASES, 
                                  SUM(MONEY) AS GS_MONEY, 
                                  SUM(DISTANCE) AS GS_DIST, 
                                  SUM(DEMOGRAPHIC) AS GS_DEMO,
                                  SUM(SUPPLIES) AS GS_SUPPLIES
                            FROM webScrape_Counts_2
                            GROUP BY 1
                            ) AS X
                         GROUP BY 1
                        
                        
                      ;""")
rows = cursor.fetchall()
rows
Out[7]:
[('COVID', 186, 186), ('GS', 183, 183), ('WASH', 19, 19)]

Now let's join them all together to see what is returned.

In [8]:
cursor = con.execute("""WITH  COVID AS (
                            SELECT COUNTRY ,
                            COUNTRYCODE,
                            DATE ,
                            SUM(CONFIRMED) AS CONFIRMED,
                            SUM(DEATHS) AS DEATHS,
                            SUM(RECOVERED) AS RECOVERED,
                            SUM(ACTIVE) AS ACTIVE
                            FROM COVID_API
                            WHERE DATE = '2020-05-17'
                            AND COUNTRY IN ('Canada', 'United States of America')
                            AND PROVINCE IS NULL AND CITY IS NULL
                            GROUP BY 1,2,3
                            
                            UNION
                            
                            SELECT COUNTRY ,
                            COUNTRYCODE,
                            DATE ,
                            SUM(CONFIRMED) AS CONFIRMED,
                            SUM(DEATHS) AS DEATHS,
                            SUM(RECOVERED) AS RECOVERED,
                            SUM(ACTIVE) AS ACTIVE
                            FROM COVID_API
                            WHERE DATE = '2020-05-17'
                            AND COUNTRY NOT IN ('Canada', 'United States of America')
                            GROUP BY 1,2,3  
                        ),


                        WASH_DATA_W AS (
                            SELECT
                                B.CNTRY,
                                A.*
                            FROM WASH_DATA_WIDE A
                                INNER JOIN (
                                                SELECT DISTINCT CNTRY_CD,CNTRY --distinct list of country and country code
                                                FROM WASH_DATA
                                            ) AS B 
                                        ON A.CNTRY_CD = B.CNTRY_CD
                        ), 
                        
                        GS_SEARCH AS (
                            SELECT COUNTRY, 
                                  SUM(VIRUS) as GS_VIRUS, 
                                  SUM(CURE) AS GS_CURE,
                                  SUM(DEATH) AS GS_DEATH, 
                                  SUM(CASES) AS GS_CASES, 
                                  SUM(MONEY) AS GS_MONEY, 
                                  SUM(DISTANCE) AS GS_DIST, 
                                  SUM(DEMOGRAPHIC) AS GS_DEMO,
                                  SUM(SUPPLIES) AS GS_SUPPLIES
                            FROM webScrape_Counts_2
                            GROUP BY 1
                        
                        
                        )
                        
                        
                        SELECT COUNT(DISTINCT A.COUNTRY), COUNT(DISTINCT B.CNTRY), COUNT(DISTINCT C.COUNTRY)
                        FROM COVID A 
                            LEFT JOIN WASH_DATA_W AS B 
                                ON UPPER(A.COUNTRY) = UPPER(B.CNTRY)
                            LEFT JOIN GS_SEARCH C 
                                ON UPPER(A.COUNTRY) = UPPER(C.COUNTRY)
                        ;""")
rows = cursor.fetchall()
rows
Out[8]:
[(186, 17, 183)]

It looks like the covid, gs data is all there, we are only missing 2 from the wash data. We can look further into this by switching the join behavior. We will put the wash data first and left join on covid where covid.country is null. When this is done, we are missing Czechia and United Republic of Tanzania. We can compare that to what is listed above.

In [9]:
cursor = con.execute("""WITH  COVID AS (
                             SELECT COUNTRY ,
                            COUNTRYCODE,
                            DATE ,
                            SUM(CONFIRMED) AS CONFIRMED,
                            SUM(DEATHS) AS DEATHS,
                            SUM(RECOVERED) AS RECOVERED,
                            SUM(ACTIVE) AS ACTIVE
                            FROM COVID_API
                            WHERE DATE = '2020-05-17'
                            AND COUNTRY IN ('Canada', 'United States of America')
                            AND PROVINCE IS NULL AND CITY IS NULL
                            GROUP BY 1,2,3
                            
                            UNION
                            
                            SELECT COUNTRY ,
                            COUNTRYCODE,
                            DATE ,
                            SUM(CONFIRMED) AS CONFIRMED,
                            SUM(DEATHS) AS DEATHS,
                            SUM(RECOVERED) AS RECOVERED,
                            SUM(ACTIVE) AS ACTIVE
                            FROM COVID_API
                            WHERE DATE = '2020-05-17'
                            AND COUNTRY NOT IN ('Canada', 'United States of America')
                            GROUP BY 1,2,3  
                        ),


                        WASH_DATA_W AS (
                            SELECT
                                B.CNTRY,
                                A.*
                            FROM WASH_DATA_WIDE A
                                INNER JOIN (
                                                SELECT DISTINCT CNTRY_CD,CNTRY --distinct list of country and country code
                                                FROM WASH_DATA
                                            ) AS B 
                                        ON A.CNTRY_CD = B.CNTRY_CD
                        ), 
                        
                        GS_SEARCH AS (
                            SELECT COUNTRY, 
                                  SUM(VIRUS) as GS_VIRUS, 
                                  SUM(CURE) AS GS_CURE,
                                  SUM(DEATH) AS GS_DEATH, 
                                  SUM(CASES) AS GS_CASES, 
                                  SUM(MONEY) AS GS_MONEY, 
                                  SUM(DISTANCE) AS GS_DIST, 
                                  SUM(DEMOGRAPHIC) AS GS_DEMO,
                                  SUM(SUPPLIES) AS GS_SUPPLIES
                            FROM webScrape_Counts_2
                            GROUP BY 1
                        
                        
                        )
                        
                        
                        SELECT  A.CNTRY, A.CNTRY_CD
                        FROM WASH_DATA_W A 
                            LEFT JOIN COVID AS B 
                                ON UPPER(B.COUNTRY) = UPPER(A.CNTRY)
                            LEFT JOIN GS_SEARCH C 
                                ON UPPER(A.CNTRY) = UPPER(C.COUNTRY)
                        where B.COUNTRY IS NULL
                        ;""")
rows = cursor.fetchall()
rows
Out[9]:
[('Czechia', 'CZE'), ('United Republic of Tanzania', 'TZA')]

When I reference the output from above, Czechia is the same as Czech Republic, and United Republic of Tanzania is the same as Tanzania, United Republic of, both are just worded a little different. I will put in a case when statement. Notice, below we have all the distinct counts from each data set.

In [10]:
cursor = con.execute("""WITH  COVID AS (
                             SELECT COUNTRY ,
                            COUNTRYCODE,
                            DATE ,
                            SUM(CONFIRMED) AS CONFIRMED,
                            SUM(DEATHS) AS DEATHS,
                            SUM(RECOVERED) AS RECOVERED,
                            SUM(ACTIVE) AS ACTIVE
                            FROM COVID_API
                            WHERE DATE = '2020-05-17'
                            AND COUNTRY IN ('Canada', 'United States of America')
                            AND PROVINCE IS NULL AND CITY IS NULL
                            GROUP BY 1,2,3
                            
                            UNION
                            
                            SELECT COUNTRY ,
                            COUNTRYCODE,
                            DATE ,
                            SUM(CONFIRMED) AS CONFIRMED,
                            SUM(DEATHS) AS DEATHS,
                            SUM(RECOVERED) AS RECOVERED,
                            SUM(ACTIVE) AS ACTIVE
                            FROM COVID_API
                            WHERE DATE = '2020-05-17'
                            AND COUNTRY NOT IN ('Canada', 'United States of America')
                            GROUP BY 1,2,3  
                        ),


                        WASH_DATA_W AS (
                            SELECT
                                B.CNTRY,
                                A.*
                            FROM WASH_DATA_WIDE A
                                INNER JOIN (
                                                SELECT DISTINCT CNTRY_CD
                                                                ,
                                                                CASE WHEN CNTRY_CD = 'CZE' THEN 'Czech Republic' 
                                                                     WHEN CNTRY_CD = 'TZA' THEN 'Tanzania, United Republic of' 
                                                                ELSE CNTRY END AS CNTRY 
                                                                --distinct list of country and country code
                                                FROM WASH_DATA
                                            ) AS B 
                                        ON A.CNTRY_CD = B.CNTRY_CD
                        ), 
                        
                        GS_SEARCH AS (
                            SELECT COUNTRY, 
                                  SUM(VIRUS) as GS_VIRUS, 
                                  SUM(CURE) AS GS_CURE,
                                  SUM(DEATH) AS GS_DEATH, 
                                  SUM(CASES) AS GS_CASES, 
                                  SUM(MONEY) AS GS_MONEY, 
                                  SUM(DISTANCE) AS GS_DIST, 
                                  SUM(DEMOGRAPHIC) AS GS_DEMO,
                                  SUM(SUPPLIES) AS GS_SUPPLIES
                            FROM webScrape_Counts_2
                            GROUP BY 1
                        
                        
                        )
                        
                        
                        SELECT COUNT(DISTINCT A.COUNTRY), COUNT(DISTINCT B.CNTRY), COUNT(DISTINCT C.COUNTRY)
                        FROM COVID A 
                            LEFT JOIN WASH_DATA_W AS B 
                                ON UPPER(A.COUNTRY) = UPPER(B.CNTRY)
                            LEFT JOIN GS_SEARCH C 
                                ON UPPER(A.COUNTRY) = UPPER(C.COUNTRY)
                        ;""")
rows = cursor.fetchall()
rows
Out[10]:
[(186, 19, 183)]

Now we can look at the final output and create the final table. I moved the tables around. Since the google search had more matches on country then the wash data, I will be putting the wash data last.

In [11]:
# this is used when reruning the program to drop the table, so we can rerun below
cursor = con.execute("""DROP TABLE COVID_DATA_FINAL 
                        ;""")
In [12]:
cursor = con.execute("""CREATE TABLE COVID_DATA_FINAL AS 
                        WITH  COVID AS (
                             SELECT COUNTRY ,
                            COUNTRYCODE,
                            DATE ,
                            SUM(CONFIRMED) AS CONFIRMED,
                            SUM(DEATHS) AS DEATHS,
                            SUM(RECOVERED) AS RECOVERED,
                            SUM(ACTIVE) AS ACTIVE
                            FROM COVID_API
                            WHERE DATE = '2020-05-17'
                            AND COUNTRY IN ('Canada', 'United States of America')
                            AND PROVINCE IS NULL AND CITY IS NULL
                            GROUP BY 1,2,3
                            
                            UNION
                            
                            SELECT COUNTRY ,
                            COUNTRYCODE,
                            DATE ,
                            SUM(CONFIRMED) AS CONFIRMED,
                            SUM(DEATHS) AS DEATHS,
                            SUM(RECOVERED) AS RECOVERED,
                            SUM(ACTIVE) AS ACTIVE
                            FROM COVID_API
                            WHERE DATE = '2020-05-17'
                            AND COUNTRY NOT IN ('Canada', 'United States of America')
                            GROUP BY 1,2,3   
                        ),


                        WASH_DATA_W AS (
                            SELECT
                                B.CNTRY,
                                A.*
                            FROM WASH_DATA_WIDE A
                                INNER JOIN (
                                                SELECT DISTINCT CNTRY_CD
                                                                ,
                                                                CASE WHEN CNTRY_CD = 'CZE' THEN 'Czech Republic' 
                                                                     WHEN CNTRY_CD = 'TZA' THEN 'Tanzania, United Republic of' 
                                                                ELSE CNTRY END AS CNTRY 
                                                                --distinct list of country and country code
                                                FROM WASH_DATA
                                            ) AS B 
                                        ON A.CNTRY_CD = B.CNTRY_CD
                        ), 
                        
                        GS_SEARCH AS (
                            SELECT COUNTRY, 
                                  SUM(VIRUS) AS GS_VIRUS, 
                                  SUM(CURE) AS GS_CURE,
                                  SUM(DEATH) AS GS_DEATH, 
                                  SUM(CASES) AS GS_CASES, 
                                  SUM(MONEY) AS GS_MONEY, 
                                  SUM(DISTANCE) AS GS_DIST, 
                                  SUM(DEMOGRAPHIC) AS GS_DEMO,
                                  SUM(SUPPLIES) AS GS_SUPPLIES
                            FROM webScrape_Counts_2
                            GROUP BY 1
                        
                        
                        )
                        
                        
                        SELECT A.*,
                                B.GS_VIRUS, 
                                B.GS_CURE,
                                B.GS_DEATH, 
                                B.GS_CASES, 
                                B.GS_MONEY, 
                                B.GS_DIST, 
                                B.GS_DEMO,
                                B.GS_SUPPLIES,
                                C.PPLTN, 
                                C.total_Basic_service, 
                                C.total_Insufficient_data, 
                                C.total_Limited_service, 
                                C.total_No_service 
                                
                              

                        FROM COVID A 
                            LEFT JOIN GS_SEARCH B 
                                ON UPPER(A.COUNTRY) = UPPER(B.COUNTRY)
                            LEFT JOIN WASH_DATA_W AS C 
                                ON UPPER(A.COUNTRY) = UPPER(C.CNTRY)
                            
                        ;""")
In [13]:
df = pd.read_sql_query('Select * from COVID_DATA_FINAL', con)

Double checking null values as they are expected by merging the data.

In [14]:
#double checking null values as they are expected by merging the data, also making the columns lowercase.
df.columns = df.columns.str.lower() #setting all columns to lower case
for i in df.columns: 
    c = df[df[i].isna() == True][i].isna().count()
    print('columns {} has {} null values'.format(i, c))
columns country has 0 null values
columns countrycode has 0 null values
columns date has 0 null values
columns confirmed has 0 null values
columns deaths has 0 null values
columns recovered has 0 null values
columns active has 0 null values
columns gs_virus has 3 null values
columns gs_cure has 3 null values
columns gs_death has 3 null values
columns gs_cases has 3 null values
columns gs_money has 3 null values
columns gs_dist has 3 null values
columns gs_demo has 3 null values
columns gs_supplies has 3 null values
columns ppltn has 167 null values
columns total_basic_service has 167 null values
columns total_insufficient_data has 167 null values
columns total_limited_service has 167 null values
columns total_no_service has 167 null values

As expected, they are all the integer values, I will be setting them to zero's, since they are all numeric data.

In [15]:
df1 = df.fillna(0)
df1.head(10)
Out[15]:
country countrycode date confirmed deaths recovered active gs_virus gs_cure gs_death gs_cases gs_money gs_dist gs_demo gs_supplies ppltn total_basic_service total_insufficient_data total_limited_service total_no_service
0 Afghanistan AF 2020-05-17 6664 169 778 5717 58.0 1.0 12.0 16.0 19.0 5.0 37.0 5.0 0.0 0.0 0.0 0.0 0.0
1 Albania AL 2020-05-17 946 31 715 200 75.0 13.0 4.0 20.0 11.0 9.0 36.0 18.0 0.0 0.0 0.0 0.0 0.0
2 Algeria DZ 2020-05-17 7019 548 3507 2964 140.0 15.0 10.0 34.0 11.0 12.0 97.0 47.0 0.0 0.0 0.0 0.0 0.0
3 Andorra AD 2020-05-17 761 51 617 93 49.0 13.0 9.0 22.0 9.0 4.0 26.0 11.0 0.0 0.0 0.0 0.0 0.0
4 Angola AO 2020-05-17 48 2 17 29 50.0 1.0 8.0 21.0 8.0 10.0 57.0 18.0 0.0 0.0 0.0 0.0 0.0
5 Antigua and Barbuda AG 2020-05-17 25 3 19 3 32.0 2.0 8.0 17.0 2.0 8.0 35.0 23.0 0.0 0.0 0.0 0.0 0.0
6 Argentina AR 2020-05-17 8068 373 2569 5126 65.0 3.0 7.0 20.0 55.0 12.0 59.0 14.0 0.0 0.0 0.0 0.0 0.0
7 Armenia AM 2020-05-17 4472 60 1925 2487 73.0 1.0 4.0 32.0 21.0 0.0 25.0 4.0 2924816.0 69.0 31.0 0.0 0.0
8 Australia AU 2020-05-17 7054 99 6392 563 132.0 4.0 18.0 33.0 27.0 28.0 100.0 51.0 0.0 0.0 0.0 0.0 0.0
9 Austria AT 2020-05-17 16242 629 14563 1050 98.0 15.0 19.0 58.0 12.0 33.0 113.0 32.0 0.0 0.0 0.0 0.0 0.0

Let’s just double check to make sure we are no longer missing values.

In [16]:
for i in df1.columns: 
    c = df1[df1[i].isna() == True][i].isna().count()
    print('columns {} has {} null values'.format(i, c))
columns country has 0 null values
columns countrycode has 0 null values
columns date has 0 null values
columns confirmed has 0 null values
columns deaths has 0 null values
columns recovered has 0 null values
columns active has 0 null values
columns gs_virus has 0 null values
columns gs_cure has 0 null values
columns gs_death has 0 null values
columns gs_cases has 0 null values
columns gs_money has 0 null values
columns gs_dist has 0 null values
columns gs_demo has 0 null values
columns gs_supplies has 0 null values
columns ppltn has 0 null values
columns total_basic_service has 0 null values
columns total_insufficient_data has 0 null values
columns total_limited_service has 0 null values
columns total_no_service has 0 null values

Geonamescache library has a function to map country names, or other data related to the country to its ISO 3 country code. We will use this to create some of our visualizations.

In [17]:
mapper = country(from_key='iso', to_key='iso3')
In [18]:
df1['iso3'] = df.countrycode.apply(mapper)
df1.head()
Out[18]:
country countrycode date confirmed deaths recovered active gs_virus gs_cure gs_death ... gs_money gs_dist gs_demo gs_supplies ppltn total_basic_service total_insufficient_data total_limited_service total_no_service iso3
0 Afghanistan AF 2020-05-17 6664 169 778 5717 58.0 1.0 12.0 ... 19.0 5.0 37.0 5.0 0.0 0.0 0.0 0.0 0.0 AFG
1 Albania AL 2020-05-17 946 31 715 200 75.0 13.0 4.0 ... 11.0 9.0 36.0 18.0 0.0 0.0 0.0 0.0 0.0 ALB
2 Algeria DZ 2020-05-17 7019 548 3507 2964 140.0 15.0 10.0 ... 11.0 12.0 97.0 47.0 0.0 0.0 0.0 0.0 0.0 DZA
3 Andorra AD 2020-05-17 761 51 617 93 49.0 13.0 9.0 ... 9.0 4.0 26.0 11.0 0.0 0.0 0.0 0.0 0.0 AND
4 Angola AO 2020-05-17 48 2 17 29 50.0 1.0 8.0 ... 8.0 10.0 57.0 18.0 0.0 0.0 0.0 0.0 0.0 AGO

5 rows × 21 columns

Checking to see if we have missing values for some countries.

In [19]:
df1[df1['iso3'].isna()]
Out[19]:
country countrycode date confirmed deaths recovered active gs_virus gs_cure gs_death ... gs_money gs_dist gs_demo gs_supplies ppltn total_basic_service total_insufficient_data total_limited_service total_no_service iso3

0 rows × 21 columns

We are going to add four new columns now that we have merged the data. I want to see percentage for the number of active, deaths, and recovered cases relative to the number of confirmed cases. I also want to sum up all the key words the found from the web scrape.

  • actvive_pct = active / confirmed
  • death_pct = deaths /confirmed
  • revocered_pct = recovered/ confirmed
  • gs_total = sum of all the key words pulled from the news articles

In [20]:
df1['active_pct'] = df1['active']/df1['confirmed']
df1['death_pct'] = df1['deaths']/df1['confirmed']
df1['recovered_pct'] = df1['recovered']/df1['confirmed']
df1['gs_total'] = df1.gs_virus + df1.gs_cure + df1.gs_death + df1.gs_cases + df1.gs_money + df1.gs_dist + df1.gs_demo + df1.gs_supplies

Visualizations

Hygiene Coverage vs Covid Death %

The first visualization is a Bubble plot of the wash data compared to the death percent from the covid virus. We are going to look at the total basic service, total insufficient data, total limited service, and total no service. These percentage represent hygiene coverage within schools, homes, and health care facilities. Basic, means the country has the essentials for hygiene. Limited, means there is some, and no service, means there is none.

For the size of the bubbles, we will use the population provided from the wash data set. This will show the number of people potentially effected by hygiene.

In [21]:
df2 = df1[(df1.total_basic_service > 0) | (df1.total_insufficient_data> 0)| (df1.total_limited_service > 0) | (df1.total_no_service > 0)]
In [22]:
plt.figure(figsize=(10,6)) # making graph easier to read
f = sns.scatterplot(x = "total_basic_service", y="death_pct",
                     data=df2
                , label = 'Basic' # used to identify basic service
                , size = 'ppltn' # used to create bubbles
                )
f = sns.scatterplot(x="total_insufficient_data", y="death_pct"
                    , data=df2
                    , label = 'Insufficient Data' # used to identify insufficient data
                    , size = 'ppltn' # used to create bubbles
                )
f = sns.scatterplot(x="total_limited_service", y="death_pct"
                    , data=df2
                    , label = 'Limited' # used to identify Limited service
                    , size = 'ppltn' # used to create bubbles
                )
f = sns.scatterplot(x="total_no_service", y="death_pct"
                    , data=df2
                    , label = 'No Service' # used to identify No service
                    , size = 'ppltn' # used to create bubbles               
                )
# pulling hanles and labels so we don't duplicate in the legend
handles, labels = f.get_legend_handles_labels() 
plt.xlabel('Hygiene Service %')
plt.ylabel('Covid Death %')
plt.title('Hygiene Coverage vs Covid Death %')
# handles/lables[0::6] used to pull labels from scatterplot 
# handles/lables[1:6] pull population information for the graph
plt.legend( handles[0::6] + handles[1:6] , labels[0::6] + labels[1:6] , bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
Out[22]:
<matplotlib.legend.Legend at 0x1e155f737c8>

Pair Plot Covid Percent's and Key Word from Web Scrape

There is so much data here within these three data sets that it is hard to visualize all of it. Showing a pair plot comparison of the percentages for the covid API, and the sum total of all keywords may give some insight into a relationship that could be happening among the two data sets.

In [23]:
sns.pairplot(df1[df1.columns[21:]]); 

Choropleth maps of COVID-19

The last three visualizations show the spread of the virus across the world. It is an interactive graph where you can zoom in, hover over each country to gather the number of cases reported as of 5/17/2020.

    Visualizations
  1. Confirmed Cases
  2. Active Cases
  3. Deaths

In [33]:
import plotly.io as pio
pio.renderers.default='notebook'
fig = px.choropleth(df1
                    , locations="iso3" # maps to the country on the map
                    , color="confirmed"
                    , hover_name="country"  # add so user is able to get the actual values
                    , height = 400  # used so graph is more readable
                    # creating color transition by the number of reported cases
                    , color_continuous_scale= [(0, "cyan"), (0.10, "blue"), (1, "indigo")]
                    , title  = 'Confirmed Cases')
# using to make visually more applealing
fig.update_layout(
    margin=dict(l=20, r=20, t=40, b=20)  # margins
    , paper_bgcolor="lightgray" #back ground color
)
fig.show("notebook");
In [34]:
fig = px.choropleth(df1
                    , locations="iso3"  # maps to the country on the map
                    , color="active"
                    , hover_name="country"
                    , height = 400 # add so user is able to get the actual values
                    # creating color transition by the number of reported cases
                    , color_continuous_scale= [(0, "lightgoldenrodyellow"), (0.10, "olive"), (1, "green")]
                    , title = 'Active Cases')
# using to make visually more applealing
fig.update_layout(
    margin=dict(l=20, r=20, t=40, b=20) # margins
    , paper_bgcolor="lightgray" # back ground color
)
fig.show("notebook");
In [35]:
fig = px.choropleth(df1
                    , locations="iso3"
                    ,color="deaths" # maps to the country on the map
                    , hover_name="country" # add so user is able to get the actual values
                    , height = 400
                     # creating color transition by the number of reported cases
                    , color_continuous_scale= [(0, "bisque"), (0.10, "orange"), (1, "red")]
                    , title = 'Deaths Reported'
                   )
# using to make visually more applealing
fig.update_layout(
    margin=dict(l=20, r=20, t=40, b=20) # margins
    , paper_bgcolor="lightgray") # back ground color
fig.show("notebook");

Summary

The project was fascinating, starting with the research. There is a lot of data out there relating to coronavirus. Deciding on the right data set and being able to merge them appropriately was a little difficult. I wanted to show some visualizations that had maps, something more than the normal bar chart, histogram, and scatterplots. What I didn’t realize, is that in order to do this, I to needed have certain geographical data. For example, some libraries require either latitude or longitude, universal transverse Mercator points, or a specific country code. Since, the data was setup to merge based on country, I just used plotly and the three-digit ISO country code.

The hygiene coverage (wash) data provided by WHO/UNICEF were csv files. I combined these files into one data set. The cleaning and validations were straight forward. The problem with the wash data, is it is pretty limited on its results. There are only 19 countries out of the 186 provided by the covid19 API data.

The most rewarding part of this project was creating the web scrape. I decided to pull data from google news. I created a function perform a google search on the coronavirus and a country name passed to the function. The result would return web address for news articles from google. I then scraped those news articles for keywords on the coronavirus. This pushed the envelop for me and my ability as a data scientist.

The API data was already pretty clean. When merging the datasets, it become evident that I should not use the geo coordinates for this project. The wash data was limited on its results, and the web scrape didn’t have data for city and province. The API data did not have complete province and city data either. Trying to create a map with half geocoordinates and locations seemed above the scope of this project. To keep the reporting consistent, I decided to summarize the data based on each country name. This gave the maps a more complete look, by filling in most countries with color.

The result, based on the information above, may indicate that we need more data. Visualization one, “Hygiene Coverage vs Covid Death %”, I am not seeing any significant patterns. It’s all sporadic among the groupings.

Visualization two “Pair Plot Covid Percent’s and Key Word from Web Scrape” There are no specific patterns that are worth pointing out. The scatterplot for recovered percent vs active, makes sense, if you had the virus and currently don’t, then you are probably recovered or dead. The death percent is interesting, in the fact that a lot of countries are report 0 deaths.

Visualization three to five, are showing the spread of the virus. The US has appeared to have been hit the hardest. This all based on what has been reported. The findings could indicate that countries may not all be reporting completely on the virus, or we need more sources to gain a better understanding.

In [ ]: